PATENT 
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REMARKS 

The Office Action in the above-identified application has been carefully considered and 
this amendment has been presented to place this application in condition for allowance. 
Accordingly, reexamination and reconsideration of this application are respectfully requested. 



Claims 1, 4-14, 16, and 19-29 are in the present application. It is submitted that these 
claims were patentably distinct over the prior art cited by the Examiner, and that these claims 
were in full compliance with the requirements of 35 U.S.C. § 112. The changes to the claims, as 
presented herein, are not made for the purpose of patentability within the meaning of 35 U.S.C. 
sections 101, 102, 103 or 1 12. Rather, these changes are made simply for clarification and to 
round out the scope of protection to which Applicant is entitled. 



In response to the Examiner's request, enclosed is a copy of the article by S. S. Pietrobon, 
"Implementation and performance of a turbo/MAP decoder," Int. J. Satellite Commun., Vol. 16, 
pp. 23-46, Jan-Feb 1998. 

Claims 1, 4-14, 16, and 19-29 were rejected under 35 U.S.C. § 112, second paragraph, as 
being incomplete for omitting essential structural cooperative relationships between elements. In 
response, Applicant has deleted the offending limitation "the input is digital information encoded 
as convolutional codes." Accordingly, Applicant believes this rejection has been overcome. 

Claims 1, 4-14, 16, and 19-29 were rejected under 35 U.S.C. § 101 because the claimed 
invention is directed to non-statutory subject matter. Specifically, the Examiner takes issue with 
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"a linear approximation means" implementing a mathematical algorithm. (Office Action pages 
2-3) Although Applicant still disagrees with the Examiner's position and reasserts the prior 
arguments, Applicant has amended independent claims 1 and 16 to directly tie the limitations to 
specific hardware elements in the decoder which perform the novel correction term computation 
portion of a log likelihood calculation used in decoding the encoded input signal. Specifically, 
the present claims are directed to "a decoder for decoding digital information from an encoded 
input signal received over a communication channel." (Claim 1, Claim 16 contains similar 
limitations) This limitation relates directly to the hardware system elements (decoder 3 and 
memoryless communication channel 2) shown in Figure 6 and described on page 20 of the 
specification. Moreover, the claims now recite "an absolute value computation circuit for 
calculating a variable based on the absolute value of said received value." (Claim 1, Claim 16 
contains similar limitations) As shown in Figure 14, the absolute value computation circuit 67 
and the linear approximation circuit 68 (i.e. the claimed linear approximation means) make up 
the correction term computation circuit 65 which occurs within several different components in 
the decoder 3. Accordingly, the recited limitations now clearly describe functions performed by 
discrete hardware elements within the decoder. For this reason, Applicant believes the present 
invention is clearly statutory subject matter as the invention is hardware/software which 
practically applies a mathematical algorithm to produce a tangible result (decoding digital 
information from an encoded input signal). 

In rebuttal to the Examiner's Response to Arguments, the Examiner states that unlike 
digital data, a digital signal requires hardware. Applicant does not understand the distinction 
being made, since presumably both are comprised of bits (Is and 0s) which cannot exist without 
a hardware construct. Secondly, the valid example provided in the MPEP as a statutory process 
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does not recite any hardware. Thirdly, the Examiner states that data structures are non-statutory. 
However, an accompanying mention of a hardware element (e.g. a memory) has been found to 
make a data structure statutory. Presumably, the present invention's "decoder" would suffice as 
such a hardware element to make the decoded data statutory. 

Therefore, Applicant believes the amended claims constitute statutory subject matter and 
respectfully requests this rejection be withdrawn. 

In view of the foregoing amendment and remarks, it is respectfully submitted that the 
application as now presented is in condition for allowance. Early and favorable reconsideration 
of the application are respectfully requested. 

No fees are deemed to be required for the filing of this amendment, but if such are, the 
Examiner is hereby authorized to charge any insufficient fees or credit any overpayment 
associated with the above-identified application to Deposit Account No. 50-0320. 

If any issues remain, or if the Examiner has any further suggestions, he/she is invited to 
call the undersigned at the telephone number provided below. The Examinees consideration of 
this matter is gratefully acknowledged. 




Respectfully submitted, 
FROMMER LAWRENCE & HAUG LLP 



Darren M. Simon 
Reg. No. 47,946 
(212) 588-0800 
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IMPLEMENTATION AND PERFORMANCE OF A 
TURBO/MAP DECODER 

STEVEN S. PIETROBON* 
Small World Communications, 6 First Avenue, Paynekam South, SA 5070, Australia 



SUMMARY 

The implementation and perfoimance of a turbo/MAP decoder are described. A serial block MAP 
decoder operating in the logarithm domain is used to obtain a very-high-performance turbo decoder. 
Programmable gate arrays and EPROMs allow the decoder to be programmed for almost any code 
from four to 512 states, rate 1/3 to rate 1/7 (higher rates are achieved with puncturing) and Interieaver 
block sizes to 65,536 bits. Seven decoding stages were implemented in parallel. For rate 1/3 and 1/7 
16-state codes with an inteileaver size of 65,536 bits and operating at up to 356kbit/s the codec 
achieved an EJN 0 of 0-32 and -0-30 dB respectively for a BER of 10 5 . BERs down to 10 -7 were 
also achieved for a small increase in EJNq. An efficient implementation of a continuous MAP decoder 
is also presented, along with a synchronization technique for turbo decoders. © 1998 John Wiley & 
Sons, Lid. 

KEY words: turbo coding; MAP decoding; synchronization 



1. INTRODUCTION 

Error control coding aims to correct erxonTcaused by 
noise and interference in a digital communications 
scheme. For power-limited schemes the ratio of 
energy per bit to single-sided noise density {E h fN Q ) 
is desired to be as low as possible. Good examples 
of this are satellite and space communications, where 
fairly low bandwidth efficiencies (A", in bits trans- 
mitted per signalling interval or bit/sym) of 0*1- 
2 bit/sym are used. 

Coding adds redundancy using special codes. A 
decoder will then use this redundant information to 
correct as many errors as possible- Shannon 1 showed 
that for an additive white Gaussian noise channel 
the smallest E h /N 0 that can be obtained for reliable 
transmission is 

(1) 

As K approaches zero or the required bandwidth 
approaches infinity, the Smartest value of E^INq is 
ln2 or approximately -1-59 dB. A typical scheme 
such as uncoded 1 quadrature phase shift keying 
(QPSK) with /i:=2bits/sym requires an EJN 0 of 
9-6 dB for a bit enor ratio (BER) of 1(K Figure }. 
plots K versus E b /N 0 > showing the Shannon capacit^ 
curvd from (1) and the capacity Gurw wheft'QPSK 
modulation is used. 2 Also plotted 'are the perform- 
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ances of uncoded QPSK and some other coding 
schemes at a BER less than or equal to 10~ 3 . Note 
that the Voyager and Galileo schemes use binary 
phase shift keying (BPSK), which would result in 
their bandwidth efficiencies being reduced by half. 
However, since Gray-mapped QPSK can achieve the 
same performance as BPSK with twice the band- 
width efficiency, we plot the Voyager and Galileo 
schemes using QPSK. 

The industry standard code for satellite communi- 
cations is a rate 1/2, 64rState, non-systematic convol- 
utionai code 3 with a soft-decision Viterbi decoder. 
This code can achieve an E b /N 0 of 4-2 dB at a BER 
of 10~ 5 , giving a 5-4 dB coding gain. In practice, 
the use of 3 bit soft decisions and other quantization 
effects in the decoder results in a 0-2 dB perform- 
ance loss. Another 0-2 dB may be lost owing to the 
use of differential encoding to resolve 180° phase 
ambiguities in the QPSK signal set. 

To obtain better performance, the standard code 
has been concatenated with a Reed-Solomon (RS) 
outer code. The most famous example is the 
(255,223) GF(2 8 ) RS code with depth eight 
interleaving used on the Voyager space probes. 3,4 
This scheme can achieve an E b /N 0 of 2-53 dB at a 
BER of IQr 6 and K = 0-875 bit/sym. A more 
advanced scheme uses a rate 1/4, 8192-state, non- 
systematic convolutional inner code and a time- 
vaiying, depth eight GF(2 8 ) RS outer code with 
redundancy profile (94,10,30,10,60,10,30,10), giving 
a (255,223-25) code on average. 3 Four stages of 
iterative decoding are used to obtain an EJNq of 
0-58 dB at a BER of IQr 7 and K = 0438 bit/sym. 
This is the most powerful code currently in use and 
is 1-5 dB from Shannon capacity at 
K =0438 bit/sym. 
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In 1993, Berrou et al published a paper describ- 
ing a new coding scheme called 'turbo-codes*. 6 A 
rate 1/2 code is described that achieved the amazing 
performance of £^/Af 0 = 0-7dB at a BER of 10~ 5 . 
This is only 0-7 and 0-5 dB from Shannon and 
QPSK capacity respectively at tf=lbit/sym. The 
encoder consists of two parallel concatenated sys- 
tematic convolutional encoders separated by a ran- 
dom interleaves The code in Reference 6 used two 
punctured 16-state codes and a 65,536 bit inter* 
leaver. This code is shown as TCI in Figure 1. 

Decoding is performed iterativeiy. Each system- 
atic code is decoded using a soft-in sof^out (SISO) 
decoder. The output of the first decoder feeds into 
the second decoder to form one turbo decoder iter- 
ation. Eighteen iterations were performed in Refer- 
ence 6. The SISO decoder used in Reference 6 is 
a modification of the maximum a posteriori (MAP) 
decoding algorithm. 7 A similar iterative decoding 
technique is described by Gallager in his 1962 paper 
on low-density parity check codes. 8 Each parity bit 
together with its associated checked inforraa&^brts 
h treated as a single (k+ 1,*)' block code. A simple 
soft-output MAP decoding algorithm is used for 
each of the parity bits* A second SISO MAP decoder 
is then used to decode the information bit associated 
with the j parity check bits. The process then 
repeats. 

The MAP algorithm finds the most likely infor- 
mation bit to have been transmitted in a coded 
sequence. This is unlike the Viterbi algorithm, 9 
which finds the most likely sequence to have been 



ttansmitted. When the decoded BER is small, there 
is a negligible error performance difference between 
the MAP and Viterbi algorithms. Since the MAP 
algorithm is considerably more complex than the 
Viterbi algorithm, it has thus been largely ignored. 
However, at low E^Nq and high BERs, MAP can 
outperform soft-output Viterbi by 0*5 dB or more. 
For turbo codes this is very important, since the 
output BERs from the first stages of iterative decod- 
ing can be very high. Thus any improvement that 
can be obtained at these high BERs will directly 
result in performance increases. 

A practical application of turbo/MAP decoders 
for satellite communications is given in Reference 
10. Various turbo coding schemes were investigated 
to achieve 2bit/sym bandwidth efficiency over a 
high-speed mobile satellite link. Using a rate 1/2 
turbo code with a 4000 bit block size, eight iterations 
and 16QAM modulation, an £ h /N Q of 3-25 dB could 
be achieved over an ideal AWGN channel for a 
BER of IfT 5 . This is only l-5dB from Shannon 
capacity at 2bit/sym. This code is shown as 
(2,1,4;12,8,M)16QAM in Figure 1. The notation 
(njc f v;mjj) is used to describe a turbo code, with 
n the number of coded bits, k the number of infor- 
mation bits (the code rate R = klri), v the memory 
of individual encoders (the number of states is equal 
to 2 V ), m - log 2 //, (where N t is the interleaver size), 
/ the number of decoder iterations and j- M or V 
indicating MAP or SOVA decoders respectively. 

The MAP algorithm does not have to be used in 
a turbo decoder. The simpler soft-output Viterbi 
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algorithm (SOVA) 1 '-' 4 can also be used. However, 
since the output of SOVA does not provide as 
good a statistic as MAP and makes more errors, 
degradations of about 0-8 dB can be expected in 
overall performance. 15 SOVA has been used in Ref- 
erence 16 to implement a turbo decoder on a chip. 
Two punctured eight-state codes were used to obtain 
a rate 1/2 turbo code which achieves an EJN Q of 
2-6 dB at a BER of 10" 5 . This is almost the same 
performance as the more complicated Voyager code! 
The interleaver size is 1024 bits and 2-5 iterations 
(five SOVA decoders) were used. The code is 
shown as TC3 in Figure 1. 

A single iteration on a chip was implemented in 
Reference 17. This chip consists of two 16-state SOVA 
decoders and a 2048 bit toterfeaver/deinierieavcr. A 
performance of £,W 0 = 1 7 dB at a BER of 10" 5 is 
achieved after five iterations for a rate 1/2 turbo code. 
This is TC2 in Figure L 

An obstacle to implementing the code in Refer- 
ence 6 is its sheer complexity. The MAP algorithm 
described in Reference 6 is very complex and is 
not very amenable to hardware implementation. By 
operating the algorithm in the logarithm domain, 18 - 19 
the complexity can be greatly reduced. Irkihis paper 
we describe the implementation and performance of 
a turbo/MAP decoder which can implement the code 
in Reference 5. The MAP decoder was designed to 
be very flexible and can be programmed to operate 
from four to 512 states and from rate 1/2 to rate 
1/4. Punctured codes can also be implemented The 
decoding speed ranges from 17-7 kbit/s for 512 
states to 624 4 kbit/s for four states. A 16-state code 
operates at 356*8 kbit/s. 

We first give the derivation of the MAP, log- 
MAP and sub-MAP decoding algorithms. This is 
followed by a description of an implementation of 
the log~MAP algorithm. We then describe the iterat- 
ive turbo decoding algorithm and give a description 
of its implementation. Actual performance curves of 
the decoder are presented, followed by some com- 
ments on continuous decoding and synchronization, 

2. THE MAP, LOG-MAP AND SUB-MAP 
ALGORITHMS 

We present here a full derivation of the MAP 
decoding algorithm - for systematic convolutional 
codes on an additive white Gaussian noise (AWGN) 
channel. The derivation is similar to that in Refer 7 
ence 18, with the final presentation of the algorithm! 
being' slightly simpler than the traditional" presen- 
tations. 

2,1. The MAP Decoding Algorithm 

The origin of the MAP algorithm belongs to 
Chang and Hancock, 20 who developed it to minimize 
the symbol (or bit) error probability for an intersym- 
bol interference (IS1) channel. Simultaneously, Bahl 
et al. 21 and McAdam et al 22 developed the algor- 



ithm for use on coded channels. The MAP algorithm 
that was presented in Reference 6 for systematic 
convolutional codes is very complicated. A simpli- 
fied version of this algorithm is given in Reference 
18. However, Berrou and Glavieux changed over to 
the more traditional presentation 13 23 of the algorithm 
in Reference 24. 

For an encoder with v memory cells we define 
the encoder state at rime k, as a v-tupie, 
depending only on the output of each delay element. 
The information bit at time k t d k% is associated with 
the transition from time k to time k + 1 and will 
change the encoder state from S k to S^. Also 
suppose that the information bit sequence {d k ) is 
made up of N- p independent bits d ki taking values 
zero and one with a priori probability (APrP) £? 
and XX respectively (?S*f 1). We let the encoder 
initial state S x be equal to zero. The last v infor- 
mation bits (<*„_„, to d N ) are set to values that will 
force the state to zero at time N+ 1 (i.e. =0). 
This will slightly reduce the rate of the encoder. 

We consider a rate 1/2 systematic feedback enco- 
der whose outputs at time k are the uncoded data 
bit d k and the coded bit c*. These outputs are 
modulated with a BPSK or QPSK modulator and 
sent through an AWGN channel. At the receiver 
end we define the received sequence 

R? A,.. r '/W. (2) 

where R k = (jc*o ; a) is the received symbol at time Jt; 
x k and y k are defined as 

*,= (24- !> + />* (3) 

y* = (^-D + ^ (4) 

with p k and q k being two independent, normally 
distributed random variables with variance a 2 . We 
define the likelihood ratio k k associated with each 
decoded bit d k as 

A '~Pr(4=l|/?f) ( } 

where ?r{d k -i\R N x \ i = 0,l, is the a posteriori prob- 
ability (APoP) of the data bit d k . The APoP of a 
decoded data bit d k can be derived from the joint 
probability defined by 

\^ = Pr(c/ fr = /,5,-/«K) (6) 

and thus the APoP of a decoded data bit d k is 
equal to 

PTid A = i\R?) = ZX<r (7) 

m 

where z = 0,l and the summation is over all 2 V 
encoder states. From (5) and (7) the k k associated 
with a decoded bit d k can be written as 
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K="~ T , (8) 

m 

The decoder can make a decision by comparing \ k 
with a threshold equal to one; 

Using Bayes* rule, the joint probability from (6) 
can be rewritten as 

= Pr(/^=iA =",/??) 

xPr</^4 = iA="A) 
x Pr(<4 = ?A = m,R k )fPr(Ry) ( 10) 

We have 

Pr^rV^/A^w^) ^ 

= Pr(*r i |S, = /«) = ar (11) 

since the assumption that S k = m implies that events 
before time £ aie not influenced by observations 
after time k. We define a m k as the forward state 
metric at time k and state wi. Similarly, we have 

Pr(/^.kA = /A = mA) 
= Pr(/CK +l =*/») = mf } (12) 

where ./((,m) is the next state given an input / and 
state rn. We define (j^ as the reverse state metric 
at time k and state m. We define the branch metric as 

S^ = Pr(^ = /A = mA) (13) 
Substituting (1I)-(13) into (10), we obtain 

Kf 1 = otpSj^f VPr{/??) (14) 
This result can be used to evaluate (8) as 



2 <W3Kf° 

m 

where the summations are over all 2 V states. The 
usual expression for (15) involves a double sum- 
mation in both the numerator and denominator. 

We want to show that (II) can be recursively 
calculated. We can express (11) as 

a? = Pr(«r'|S Jt = m) 



i 

= 22 P'<4-> = « ») 

«' /-0 

i 

m' ;=0 

x Pr(4-, = AS*., = mV? t _,|5* = m) 
i 

= 2 pwiisu -*(/.«)) 

- ■ 

= I*ff* ! (16) 

where the first summation is from m' = 0 to 2*-- 1 
and £(/Vn) is the state going backwards in time 
from state m on the previous branch corresponding 
to input j. In a similar way we can recursively 
calculate the probability p* from the probability 
P'J+i- Note that this is possible only after the whole 
block of data is received. Relation (12) becomes 

PP = ft(fif|S 4 = m) 

i 

- 22 PWk «JJS M = m'J$\S* - "0 

m'H> 

= 22 PWUSk = m4*=}J M « m'A) 

x Pi(</ t = jA +1 = , n ',« t |S, = m) 
i 

= EP^ + il^i=A/Vn)) 

xPt(rf 4 =/A = mA) 
i 

-S^'P&f 0 (17) 

Figure 2 gives a graphical illustration of the calcu- 
lation of a m k and P'J, It is very similar to the 
architecture of the Viterbi algorithm. Where we add 
the branch metric to the state metric in the Viterbi 
algorithm, we multiply in the MAP algorithm. 
Where we find the minimum of the path metrics in 
the Viterbi algorithm, we add in the MAP algorithm. 
Hius the add-compare:- select (ACS) operation in 
the Viterbi algorithm becomes the multiply-add 
(MA) operation in the MAP algorithm. 

The MAP algorithm works by first calculating the 
oc'Js in the forward direction and storing the results. 
The p^s are then calculated in the reverse direction. 
An important observation is that the dj^'P&f 0 term 
in (17) is also used in the calculation of k k in (15). 
Thus, while the p?s are being calculated, \ k should 
be calculated at the same time, reusing the 
'8fcT3jKf ) terms to minimize the number of compu- 
tations. If the block starts in state zero, then we 
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time JM 



n*0» 




Jt+1 




Figure 2. Graphical representation of calculation of at and p*J 



initialize a? = 1 and a?=0 for m^O. A similar 
: initialization is performed for p;j +t if the block ends 
: in state zero. If the block ends in an unknown state 

(which occurs when there is no termination), then 
. ; jijki = 1 for all m. 

The branch metric &£ ffl can be determined from 

the transition probability of the discrete memoryless 

channel and the APrR From (13) and using Bayes' 

rule, we have 

8^' = Pr(4 = iA-mA) 
= Pt{R k \d k ~i,S k = m) 
x Pi{S k = m\d k = i)?x{d k = 0 

= ft(JC*|rf* = (18) 
xPiO*& = AS* = m)&/2» 



since p k and q k are independent, the current vState is 
independent of the current input and can be in any 
of the 2 V states, and & = Pv(d k = i) by definition. 
For and AWGN channel with zero mean and vari- 
ance <r\ (18) becomes 



K*fi exp[4(x A / + V"')] 



(19) 



where k* is a constant, d** and dy* are ffie differen- 
tials of x k and v fc , L r = 2/<r 2 and is the coded 
bit given d* = i and S k - m. Since the constant k* in 
(19) does not affect K k in (15), we can normally 
ignore k a . In practice, though, when calculating the 
forward and reverse state metrics, we let K k be equal 
to the inverse of the largest previous state metric. 
This normalizes the new state metrics and ensures 
that the state metrics do not under- or overflow, 
If we substitute (19) into (15), we obtain 



K ~ ' r \ exp(-L < a t ) 
£>* 

X a? expa^c 0 ^)^ V 

x j» : 

2arexp(L c y,c 1 - m )(3£!r> 

m 

= 1* exp(-L^ t )5' fc 



(20) 



where ^ = QfQ is the input APrP ratio and i\ is 
the output extrinsic information. One can think of 
£' A as a correction term that changes the input 
information so as to minimize the probability of 
decoding error. This extrinsic information is very 
important in turbo decoding as it allows the correc- 
tions terms to be passed from one decoder to the 
next. 



2.2. The iog-MAP Decdding Algorithm 

To minimize the decoding complexity, we would 
like to eliminate the multiply operations required by 
the MAP algorithm. This can be achieved by taking 
the logarithm (or negative logarithm) of the algor- 
ithm. Again, this technique was first used for the 
IS! channel. 25 * 26 It was later applied to the coding 
channel in References 18, 19 and 27-29. 

Taking the negative logarithm, the multiplications 
in the algorithm are converted to additions. Adders 
are much easier to implement than multipliers. How- 
ever^lhe^ 
defined below: 



a E b s - log e (e^ + e~*) 

= min(a,b) - log c ( i + e*"*) (2 1 ) 

The functions min(a,f>) and \a-b\ can be easily 
determined using subtraction and multiplexer cir- 
cuits. However, the function 

At) = log e (l + €"<) = c In(l + (22) 

where c= l/lne = log c e, would appear to be too 
complicated to be implemented. Figure 3 plots J[z) 
against z for c= 1. We can see that f(z) quickly 
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Figure 3. Plot of f&) versus z for r= 1 



decays to zero and has a maximum^value of 
c In 2 0-693c (for z ^ 0). Thus J[z) can be easily 
implemented in a small look-up table. A range of 
look-up tables can be implemented to cover various 
values of c. 
If we let 



Dr=-iog e sr 

the MAP algorithm becomes 

2"-i 



(23a) 
(23b) 
(23c) 
(23d) 



where K k is a constant, A = (2/cr 2 )log e e = L c c and 
z* = -log e £* is the log- APrP. To perform ienormaliz- 
ation, we let K k be equal to the smallest previous 
state metric. We can think of A as the no-noise 
amplitude of bur demodulated and quantized signal, 
e.g. +1 could be equivalent to A = 7 = 111 2 . Note 
that we can arbitrarily vary A and L v to determine 
c and thus the values in a lookrup table for (22). 
Alternatively, given L ( . and a look-up table for a 
value of c, we can vary A. We can re-express 
(20) as 



L k = z k + Ax k + zf k 



(28) 



L k = E AS + D° k ^ f flgy»> 

Mf=0 

2M 

- E A k " + Z>^" + flft 



»i«0 

i 



A? = E-A^ ,M) + Dfc?p M) 
i 

a? = E or + flg/f > 



(24) 



(25) 



426) 



where E# ot> = a 0 E a 1 E- - -E a'" 1 . Hie branch met, 
rics are 



!%» = -logeK^ - log € & - A(x k i + y^) 
^^ipgcjc^jog^^ 

--K*-(Z4; + Ax k )i -Ay k ^ m (27) 



where z*--log € £' A is the extrinsic information from 
the log-MAP decoder. 

2.3. The sub-MAP Decoding Algorithm 
If we let /U) = 0, then (24)-(26) become 

L k = min (A k + Df m + Bffff)) 

M 

- min (A? + Di" + flRV 0 ) (29) 

m 

AT = min (A£ 0 1 •" ,, + D?^ 0jn \ A&V 0 

+ ^f' m) ) (30) 

5? * min (Dfr* + B®f \ Df* + B£ir>) (3 1 ) 

This suboptimal algorithm (which we shall call sub- 
MAP) has the advantage that it is independent of 
cr 2 Again, this algorithm was first derived for the 
ISI channel 25 - 26 and then Jater applied teethe coding 
channel. 19 - 30 It has the same suboptimaj [ hard-de^cison r 
performance as the Viterbi algorithm. 1 * Note that 
the calculation of the forward state metrics is exactly 
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the same as the state metric (SM) calculation for 
the Viterbi algorithm. However, unlike the Viterbi 
algorithm, the SMs also need to be calculated in the 
reverse direction and the likelihood ratio determined. 

3. IMPLEMENTING THE LOG-MAP 
ALGORITHM 

As can be seen from (24)-(27), there arc four 
major sections in a log-MAP decoder, These are 
the forward state metric calculator (FSMC), the 
reverse state metric calculator (RSMC), the log like- 
lihood ratio calculator (LLRC) and the branch metric 
calculator (BMC). To minimize the decoder gate 
count, it was decided to have a serial implemen- 
tation; that is, the SMs are computed one at a time. 
This is similar to previous serial Viterbi and trellis 
decoders constructed by the author. 31 * 32 Since there 
are V states, this implied that it would take at least 
2 W decoder clock (CLK) cycles to decode each bit. 
Thus the CLK frequency had to be at least 2 V times 
.greater than the data clock (DCLK) frequency. 
;: to avoid problems with high-speed clocks, it was 
decided to limit the CLK speed to 10 MHz, Also, 
the Xilinx XC3100A 33 series programri&ble logic 
was chosen owing to its low cost and relatively 
high speed. 

3.1. Branch Metric Calculator 

. From Reference 34 the BMs are calculated as fol- 
lows: 



DiT = h + Ax k \(i ® u(z k + Ax k ) ) 
+ M(c^0 u(Ay k ) )-K k 



(32) 



where i © j is the modulo--2 sum of / and j and 
u(w) is the unit step function. As before, K k is equal 
ti> the minimum previous state metric. In two's 
complement notation, u(w) corresponds to the logical 
inverse of the most significant or sign bit of w. It 
can be shown that 

. M(/©*<wM = rMy + (w+|w|V2 (33) 
The - wj term in (33) corresponds directly to the 
: ferms in (27) with w = z k +• Ajc a and j = / or w = 
Ay k and j = c° n . Tht additional terms are constants 
dependent on k only and can be absorbed in K k * 
We can see that when the BM in (32) is added to 
ks state metric, the resulting value will always be 
greater than or equal to Zero. 

An important consideration is the value of the 
signal amplitude A. The optimum value of A can 
vary depending on the number of quantization bits 
q and the noise variance cr 2 . 35 Figure 4 illustrates a 
model of the demodulator for the received signal x k 
(as defined in (3)). A model of the branch metric 
calculator is also shown (we have assumed that 
?A = 0 in this case). We assume that x k is multiplied 
by some unknown fixed positive voltage V. The 



demodulator has an automatic gain control (AGC) 
circuit that effectively normalizes the input signal 
to its mean absolute value, i.e. 



E[|VxJ]»VE[|24-l 
s= Vmag(a) 



(34) 



Note that for high SNR (and low a), mag(a) L 
However, for low SNR (and high a) we have 
mag(cr) <W27tt 0-798o. This is very important 
when trying to estimate the noise variance. A vari- 
ance estimator that assumes mag(<r) = i will work 
correctly only for high SNR. 

For turbo codes where a low SNR is expected, a 
more complicated method is required to determine 
V and o\ We can see from (34) that we have two 
unknowns and one equation. To solve for V and a, 
we need another equation which can be obtained by 
estimating the square of the received signal; 



(3.5) 



Figures plots raag(a) and rms(a) = VT+ V. We 
can see that for low SNR, rms(a) a. Owing to 
the complexity of (34). there does not seem to be 
a simple direct solution of the two equations. The 
estimation of (34) and (35) can be performed digi- 
tally. By quantizing (34) and (35) into, say, 8 bits 
each, a 64K x 8 look-up table can be used to output 
precomputed u values quantized to 8 bits each. 
Alternatively, the ratio of the square of (34) with 
(35) will give a single equation as a function of 
a. 36 A smaller 256 x 8 look-up table can then be 
used to output or. 

The value of A compared with the dynamic range 
of an analogue- to- digital (A/D) converter can gre- 
atly affect the decoded BER 35 (especially when the 
number of quantization levels is small). We shall 
assume that there are 2''-l quantization regions 
with a central 'dead zone*; that is, a quantized *0* 
ranges from -0-5 to 0-5. The largest quantized value 
is 2«- 1 - 1 and ranges from 2** 1 - 1-5 to infinity. As 
shown in our demodulator model in Figure 4, we 
shall assume that the demodulator scales the input 
by C before A/D conversion. In Reference 35, 
computer simulations of the Viterbi algorithm 
showed that C should be less than one. Also, as 
the SNR is decreased, the value of A relative to the 
maximum quantized output should decrease as well. 
From Figure 4 we have that the 'optimum' value of 
A is 



A = 



C(2*-' - 1) 
mag(or) 



(36) 



Analysing die simulations in Reference 35, we found 
that C = 0-65 should give near-optimum perform- 
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AGC Model A/D Model Branch Metric Model 

Figured Demodulator and branch metric model 
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FigureS. Ahsolute mean and root mean square of received signal 



ance. However, for low cr it may be necessary to 
reduce C (and thus A) to keep the look-up tables 
for the E operand at a reasonable size. 

For our BMC it was decided to have </ = 6 bit 
quantization as a compromise between having good 
performance and decoder complexity. Each MAP 
decoder could be programmed for either rate 1/2, 
1/3 or 1/4 operation. For rate 1/4 mode and z A . = 0 
this implied that the maximum BM value is 
w (2<m - 1)= 124, where n is the number of coded 
bits. In turbo decoding mode, z k can range from 
-128 to +127, which can greatly increaseJthe 
maximum BM. Owing to the limitation of "the num- 
ber of bits to represent each SM; the maximum BM 
was therefore limited to 127. 

One BM is calculated each CLK cycle with a 
two-CLK-cycle pipeline delay (one cycle to deter- 
mine the symbol and another cycle to perform the 
calculation). The BM is then passed onto the FSMC 
and an SRAM for storage. The BMs stored in the 
SRAM are then read out in reverse order for the 
RSMC. Since 2 V BMs are calculated for N DCLK 



cycles, the total storage space required in N2 V . 
Although there are only 2" BMs, we chose to simply 
store and then later retrieve the previously calculated 
BMs for the RSMC. 

Ideally, for the /V= 2 16 turbo code in Reference 
6 the required memory storage is one megabyte 
(MB). This was too large and too expensive to 
implement and so it was decided to limit the storage 
space to 64K. For v = 4 this implies that 
iV=2 ,2 = 4096. A simplistic storage technique is to 
write the new data into one 64K RAM and read 
the old data from another 64K RAM. This requires 
a total of 128K of RAM. We can halve the amount 
of RAM by using the circuit shown in Figure 6. 

When CLK is high, the previously stored BM is 
read out from the RAM and then latched on the 
falling edge of CLK. Simultaneously, a new BM is 
written into the RAM using the same address when 
CLK goes low. Thus, in one clock cycle we perform 
a read followed by a write. Since the RAM address 
is inverted every N DCLK cycles, the BMs are read 
out in reverse order. A control signal to the CE 
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input of the RAM is used to enable the storage of 
the new SMs at the correct time. In our design, two 
35 ns 64Kx4 separate I/O SRAMs were used. 

Our design could be programmed from four to 
512 states, with a corresponding change in N from 
: 16,384 to 128. However, to limit the decoding delay 
and storage requirements for the turbo decoder, the 
maximum block size for four and eight states was 
reduced to 4096. 



gn8i*8i» • "St) corresponds to the encoder code and 
gtd e {0,1 } for Os/<n-l, We assume that 
= 1 for 0 2E» i' 2£ n 1, since this ensures that 
the free Hamming distance leaving and entering a 
state is at least 2n. 
For the forward SMs we need to determine 
A) from (25). If we let 

*(*-•, A) = A., = (sU iJT'rf) (37) 



3:2. State Metric Calculators ^ 

The architecture of the FSMC and RSMC are 
very similar. A 35 ns 1Kx8 dual-port RAM is used 
tp retrieve the old SMs and store the new SMs. 
With 8 bit precision the SMs can range from zero 
to 255. A further increase in precision would have 
greatly increased the complexity of the decoder 
Thus it was decided that the SMs would be rep- 
resented by 8 bits. 

: At the end of the previous block the initial SMs 
: are stored into one side of the RAM. For the FSMs 
the SM for state zero is : set to zero and the other 
States are set to 255 (the closest value to infinity). 
"This corresponds to the sequence starting in state 
zero. For the RSMs, if the final state is unknown, 
all the initial SMs are set to zero, otherwise they 
are initialized in the same way as the FSMs- 

To determine the read and write addresses for the 
SMs, we need to examine the implementation of a 
rate Un systematic encoder. Berrou et a/. 6 showed 
that 'a" rate 1/2 systematic encoder can be 
implemented using a shift register as shown in 
Figure 7. We have that 5* . corre- 

to the current encoder state, G,- 



then 



where 



s* = (4-i e e . ^r l ) (38) 



st - 2 8&i mod 2 



(39) 



We can see that by reading two SMs we can 
generate two new SMs. Figure 8 gives a partial 
trellis for a v = 4 code. Thus in the next data clock 
(DCLK) cycle the previously stored SMs are read 



FigureS. Subtrellis for v = 4, rate II n systematic convolutional 
code 




Figure 7. Rate 1/2 systematic convolutional encoder 
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out one at a time using the following forward read 
address (ignoring the subscript k): 

Rt = (s\s 2 sr l J>) (40) 

A serial- to- parallel operation is performed and the 
two SMs are stored in a register for two CLK 
cycles. In the first CLK cycle, BM° is added to the 
first SM and BM 1 is added to the second SM to 
form the first new SM In the next CLK cycle the 
BMs are reversed to form the second new SM. 
After a delay from reading and calculating the new 
SMs (equal to five CLK cycles), the new SMs are 
written into the dual-poit RAM with the following 
address (ignoring the subscript k): 

W ( =(s°Bs*^ t s 2 (41) 

From (26) we need to determine f{d k J> k ) for the 
reverse direction. If we let 

MA) = S*h = © *2 © vW,. - -*vr 1 > 

(42) 

then 

A=(siA.-.rf- , ^S) (43) 

Thus in a similar way to the FSMC the read and 
write addresses for the RSMC are 

^ r «(j°e*VA.y H ). (44) 

W r = (s l ^..X"^°) (45) 

Since there is a delay in Calculating the new SMs, 
we cannot use the read/write technique as used for 
storing the BMs. With a IK x 8 DP-RAM this 
implied the maximum number of states is 512 (half 
the memory size). Hie minimum number of states 
is fqur owing to the restriction that rf = £7=L 

Since the forward SMs are used in the LLRC, 
they also need to be stored and read out in reverse 
order., A circuit very similar to the BM storage 
circuit is used to perform this task. Hie old forward 
SMs that are read from the DP-RAM are the values 
that are stored in two 64Kx4 SRAMs. 

An important part of the SM calculator is the 
implementation of the adders and the E operand. 1 
Figure 9 illustrates how the BMs are added -to .-the*- 
SMs (similar to that in Reference 37). We see that 
the previous minimum SM is subtracted from the 
BMs before being added to the SMs. The output of 
the BM subtraction circuit has a two's complement 
output (we limit the . output so that the range is 
from -128 to +127). The BMs are then added to 
the SMs (just as in the Viterbi algorithm). If the 
BM is positive, the SM limiting circuit is enabled. 
However, if the BM is negative, the limiting circuit 
is disabled to allow normal addition to occur. Since 



the BM can never be more negative than the small- 
est SM, the resulting SM will always be positive. 

Figure 10 illustrates the E operand circuit. As 
shown in (21), we need to find the minimum of 
the two path metrics (PMs) as well as the absolute 
difference. The carry-out of a subtraction circuit 
(with carry-in set to zero) is used to select the 
smallest path metric. This carry-out is also used to 
invert the output of another subtraction circuit to 
give the absolute difference. When PM° is greater 
than PM 1 , the output of the subtractor will give the 
correct positive output when the carry-in is equal 
to one. However, if PM° is less than or equal to 
PM 1 , then the output will be negative. Normally, 
we would have to invert the output and then add 
one. However, by setting the cany-h) to zero, we 
avoid the additional adder circuit. This is why the 
carry-put of the first subtractor goes into the carry- 
in of the second subtractor. 

A circuit mat uses a comparator, two multiplexers 
and a subtracter can also be used to implement the 
minimum and absolute difference functions: 29 In 
XC3100A logic this would require at least 20 con- 
figurable logic blocks (CLBs) to implement using 8 
bit arithmetic, compared with 16 CLBs for our 
design (since the XNORs can be absorbed into the 
second comparator). 

In order to keep the SMs positive, we add c ln2 
to (21). Thus, if the absolute difference is equal to 
zero, We add zeiro to the minimum, otherwise we 
add some small value. As determined previously, 
the maximum value of fiz) is c ln2. For a rate 1/7 
code, E b W 0 = -0-5dB and 6 bit quantization we 
have ex 2 = 3-93, and using (36), A = 11-3 (which we 
round to 11). Thus c = o-?A/2 = 21-6 and the 
maximum value of J{z) is 14-99 (15 after rounding). 
Thus only 4 bits are required to represent j{z). 

The smallest non-zero value of J[z) that results 
in zero after quantization is 0-5. Solving for 
Jiz) = 0*5, we have 

z = --cln[exp(0.5/c)-l] (46) 

For the above conditions we have z- 81*2. Since 
fiz) decreases monotonically with z, all values of 
J{z) for z 2= 81-2 will be less than 0-5 and so will 
be quantized to zero. Therefore, by limiting values 
above 81 (the quantized value of 81»2) from the 
absolute difference circuit to 81, we only require an 
81 x 4 look-up table to implement j{z). In our design 
we limited the maximum address space to 63 in 
order to reduce the design complexity. This implied 
from (46) that the largest value of c is 17-835 
(giving z = 63*5). Thus, if the optimum value of A 
cause? c to be more than 17-835, we reduce A to 
give us the maximum c. For the above example this 
would be A = 9-08 (or A = 9 after rounding). 

Figure 10 shows how the absolute difference out- 
put is limited to 6 bits and used to address a 64 X 4 
look-up table to determine /(z) + cln2. The table 
look-up output is then added to the minimum circuit 
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output. The adder output is limited to 8 bits as 
:. shown; Just as in a Viterbi decoder, limiting the 
outputs of the adders will cause some degradation 
iip performance. However since it is the larger and 
thus less likely path metrics that are limited, this 
degradation will be very small 

3,3- Log Likelihood Ratio Calculator 

For the computation of \£ (EM') we add the 
reverse jjath metric for bit i (RPM') to the corre- 
sponding 'forward SM that is read from the SRAM. 
This is shown in Figure 11. The summation gives a 



9 bit result which is not limited. A most significant 
sign bit is also added owing to the effect of fijz) 
(in this case we cannot add cln2 because we are E 
summing more than one term). The register is 
initialized to its maximum value (+511) to start the 
E summation (or 'eccumulaiion*). We could have 
used a multiplexer to initialize the EM'" register to 
the first summation. However, to reduce complexity, 
a simple limiting circuit was used. Since on the first 
E summation +511 is usually much greater than the 
first metric, the EM f register would be correctly 
initialized most of the time. 
We then find the minimum and absolute differ- 
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ence between the current register output and the 
current metric (the summation of RPM' and SM). 
We then perform the standard look-up operation, in 
this case subtracting the look-up result from the 
minimum. To decrease the delay between the regis- 
ter output and its input, we also register the outputs 
of the abs-min circuit. This forced us to insert a 
multiplexer after the first adder (the other input was 
the EM' output) to allow the correct EM' to be 
calculated owing to pipeline delay. 

After the EM's have been finally calculated 
(which is done in parallel), we subtract EM 1 from 
EM 0 and limit the output to +127 or -128 if neces- 
sary to give an 8 bit result. An additional three 
clock cycles are required to perform the final calcu- 
lation. Since the LLRs are calculated with- the 
reverse SMs, the LLRs are produced in reverse 
order. Thus for a normal MAP decoder the output 
needs to be reversed in dme in blocks of N. 

3,4. Decoder Performance 

Since up to two clock cycles are required for the 
decoder to start, two for the BMC, five for the 
SMCs and three for the LLRC, a Total of 2 W +12 
clock cycles are required. All the logic was 
implemented in XC3100A-5 gate arrays which 
allowed a clock speed of / c = 10 MHz. The SRAMs 
vvere 35 ns in speed. Also, v bits were used to 
terminate the trellis, which slightly reduces the speed 
of the decoder. The decoder speed is given by 



2 I6 -*(2 V + 12) 



(47) 



, 5 =s v ^ 9 



The slowest decoder speed is for v = 9 at 17-7 kbit/s 
and the fastest decoder speed is for v = 2 at 
624-7 kbit/s. For v = 4 the speed is 356-8 kbit/s. 

The SMCs and LLRC were implemented in one 
XC3190A-5 (7500-gate equivalent). The BMC was 
implemented in one XC3142A-5 (3700-gate 
equivalent), while two XC3130A-5s (2700-gate 
equivalent) were used to implement the control logic 
and address generation (as well as some additional 
functions for the turbo decoder). The^encoder was 
implemented using an XC3142A-5. Both the encoder 
and decoder can be programmed with any code with 
g°i - 8 v i - * through a series of DIP switches./ - 

To test the decoder performance, AWGN was 
generated on a PC using a' C++ program. The 
parallel port on the back of the PC was used to 
transmit 8 bit quantized noise samples to the 
decoder. An adder circuit in the encoder Xilinx chip 
adds the noise to the encoded signal. The adder 
circuit produces a 6 bit quantized result with a dead 
zone and 63 quantization regions. Various noise 
generators were used, starting with one that had a 
period of 8-4x10* for the MAP decoder tests, 



2-1 x 10 9 for the rate 1/3 turbo decoder tests and 
2-3 x 10 18 for the rate 1/7 turbo decoder tests. This 
last decoder used the 'standard' Lehmer uniform 
random number generator with multiplier 16807 and 
modulus 2 3l -l, 38 together with the Box-Muller 
algorithm from Reference 39 (pp. 216-217) to gen- 
erate 2 23 8 bit quantized random numbers which are 
stored in the RAM (after inputting A and <r). The 
dual 32 bit speed uniform random number generator 
from Reference 40 was then used to randomly select 
the numbers from the RAM and send them to 
the encoder. 

Figure 12 gives the performance of a rate 1/4, 
512-state systematic code with code polynomials 
gg m 1753, gl m 1547, g 2 * 1345 and ft = 1 151 taken 
from Reference 41 The performance of the hard- 
ware log-MAP decuder is plotted, along with the 
performance of the hardware sub-MAP decoder and 
a software block Viterbi decoder. The signal ampli- 
tude was fixed to A = 7 to avoid limiting the SMs 
too much. This is less than the 'optimum* values 
for 6 bit quantization, but with only 8 bit state 
metrics, SM limiting becomes more of a factor. At 
low BER we can see that the implementation loss 
is only 005dB. Note that at low BER the ideal 
performances of the Viterbi and MAP algorithms 
are almost identical, which allows us to make a 
comparison. At high BER the MAP decoder gains 
about 0*5 dB over ideal Viterbi. 

Also plotted on the graph is the sub-MAP decoder 
performance. This is where the E operand is simpli- 
fied to the min function. This was done by setting 
the look-up tables in the MAP decoder to zero. The 
signal amplitude was also set to A = 7. As can be 
seen, at low BER the sub-MAP performance is 
almost identical to that of the MAP decoder. How- 
ever, the sub-MAP decoder always performed worse 
than the ideal Viterbi, losing about 03 dB at high 
BER. 

Figure 13 shows the performance for a systematic 
rate 1/2, 16-state code with polynomials £ 0 -37 and 
g { =21 from Reference 6. Since the rate loss from 
the tail bits is very small, we could compare our 
decoder results with a software; MAP decoder from 
Reference 42. The hardware MAP decoder closely 
follows the computer simulation at high BER, losing 
008 dB at low PER. The sub-MAP decoder loses 
about 0-5 dB at high BER and closely follows the 
MAP decoder at low BER. 

4. TURBO CODING AND DECODING 

The basic turbo encoder consists of two or more 
parallel concatenated systematic convolutional enco- 
ders separated by an interleaver of size N h Figure 14 
shows the general implementation of a rate 1/3 turbo 
encoder from two constituent rate 1/2 encoders. The 
two coded outputs can be punctured in order to 
obtain higher rates. The INT block is the interleaver 
and the DEL block is a delay circuit with a delay 
equal to the MAP decoder delay, the DEL circuit 
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Figure 12. Systematic rate 1/4, 512-state BER performance 



is required since the decoder for the interleaved data 
has to wait for the first decoder to output its data. 
Thus the received data have to be appropriately 
delated. Instead of the decoder delaying the received 
interleaved data, this delay is performed within the 
encoder, simplifying the decoder operation. 

Figure 15 illustrates the basic decoding block in 
an iterative turbo decoder. In the first iteiation we 
do not need to add the APrP for d k to Ax k . Since 
j d k - 0 or 1 is equally likely, we have that zj = 0 (a 
superscript 2 is used here as explained later) is 
added to Ax*. We then decode the symbols from: 
the fir$t encoder. The output from the first MAP/ 
decoder is A} -/be,- 4 z), where the superscripts indi- 
cate which MAP decoder was used/ and ,/ = k - £) d , 
where D d is the delay of the MAP decoder; These 
data are then interleaved to match the interleaved 
symbols from the second encoder, 

The Axj + z l f from the first MAP decoder is then 
fed into the second MAP decoder. In this case we 
have let the extrinsic information from the first 
MAP decoder become the APrP for the second 
MAP decoder. One can think of the first MAP 



decoder improving the SNR for AXf f which effec- 
tively results in a lower BER for d k * The purpose 
of the iriterleaver is to randomize the 'burst* errors 
that are characteristic of MAP and Viteibi decoders. 
Hie larger the interleaves the more randomized are 
the bursts of errors. 

With an improved Axj the second MAP decoder 
is able to correct even more errors. Its output is 



(48) 



where z? is the extrinsic information from the second 
MAP decoder and the subscript i is used to indicate 
the interleaved and delayed time index. We subtract 
z l i+AXi from \ 2 t to obtain z 2 h which is then deintei- 
leaved and passed onto another iteration as zj|_ A 
(where A is the total delay of the iteration). Just 
like in the second MAP decoder, this extrinsic infor- 
mation becomes the APrP for the first MAP decoder 
in the next iteration. The deinterleaver serves to 
randomize the burst errors from the second MAP 
decoder. Note that if we include z} and z}> the dein- 
terleaver would produce a delayed z) with 'bursty' 
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errors. Thus feeding these bursty errors into the first 
MAP decoder will result in very poor performance 
from the decoder. 

In the next iteration the first MAP decoder out- 
put is 

Xj = ^+4x y + 2 ; (49) 
In this case we subtract zj from \J to give the 
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Figure 15. Interative turbo decoder 
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desired Axj+z)* which is then fed into the second 
: MAP decoder as before. Again, we do not want 
to include z)> since after interleaving* it would be 
bursty again. 

Obviously, in the first stage of decoding we want 
to obtain the lowest BER possible. However, at low 
SNRs a code's performance may be opposite to 
what is expected An important observation is that 
the more powerful a code is at tow BERs, the worse 
it performs at low SNR and high BERs. 42 Thus we 
do not want to choose too complex a code, as 
: iterative decoding will give poor performance. How- 
ever, we also do not want to choose too weak a 
Code (in terms of performance at tow BER), as the 
latter stages of decoding Will not be powerful 
enough to correct any further errors. This was first 
noticed in Reference 6, where a instate code was 
found to be optimal for a rate 1/2 turbo decoder. 

4.1. Decoder implementation 

The additional circuits required for a turbo 
decoder are the delays for z) and Axj f z)t the inter- 
leavers and deinterleavers, the delay for 
Ax k Ayl* . *AyF l * an adder and two subflfcctors. For 
the last stage of decoding the deinterlever can be 
used to deinterleave the second MAP decoder output 
with the addition of a multiplexer. With a rate 1/4 
MAP decoder a rate 1/7 turbo decoder could be 
constructed. To reduce the number of inputs, the 
data were received serially, one symbol at a time. 
Thus BPSK modulation could he directly used, 
although with some modifications QPSK could also 
be used. 

Each MAP decoder has a maximum delay of 
4096 bits and, as shown later, each inierleaver has 
a maximum delay of 65,536 bits. Thus the total 
delay for the seven symbols is 
7 x 2 x(4096 + 65,536) = 974,848. Thus six lMxl 
SRAMs were used to delay the received data. The 
MAP decoder was initially designed to have a I6K 
block size and thus a 16K deiay (for v = 2). With 
this extra delay, six 256K x 1 SRyYMs were included 
in the design, but are no longer necessary. Also, 
two 16Kx4 separate I/O SRAMs were used for 
each of the and Ax t + zj ' delays* although two 
4K x 4 SRAMs wpuld be sufficient. 

Since the interieaver and deinterleaver; architec- 
tures are similar, we will only discuss the implemen- 
tation of the interleaves. For the interieaver we havfe 
used : the same read/write technique asused-by ttte " 
delay and reversing circuits elsewliere in the design 
(thus giving a delay equal to the 'interieaver block 
size). The maximum interieaver Size is the Same as 
in Reference 6, i.e. N t = 64K. Thus only two 64K X 4 
separate I/O SRAMs are required to implement the 
interieaver. However, the interieaver address gener- 
ator circuit is a little more complex and is shown 
in Figure 16. At start-up the SEL line goes high 
and the counter output is stored into the SRAM. 
The SRAM output is then read out and latched by 
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Figure 16. Interieaver address generator 

the D-FFs. Using this address, the data are sequen- 
tially stored into the interieaver SRAM, At the same 
time, SEL gp0s low and the interieaver address is 
read front the EPROM and stored into the SRAM, 
to be read out in the next interieaver block. The 
process repeats, interleaving the previous set of 
addresses. 

Note that only one interieaver address generator 
(IAG) is implemented. Thus only one set of 
EPROMs needs to be programmed with any inter- 
ieaver that is desired. The time reversal of the MAP 
decoder output can also be incorporated into the 
interieaver EPROM, reducing the delay and com- 
plexity of the decoder. Owing to the two MAP 
decoder delays (equal to 8K)i the interieaver address 
between each iteration also has to be delayed. We 
used two 8K x 8 SRAMs to perform this task. Since 
8K*8 SRAMs have common I/O, the circuit in 
Figure 17 was used to separate the I/O and allow 
the read/wnte technique to be used. 

A disadvantage of the above scheme is that if an 
error occurs in the address generator, error propa- 
gation occurs and the decoder will have to be reset. 
However, in the many days of testing we performed 
on our codec, the decoder never had to be reset 
owing to the interieaver failing or any other part of 
the decoder failing. 

The first encoder sequence starts and ends in state 
zero through the use of a v-bit tail. Interleaving is 
performed oh all the information and tail bits. The 
second encoder sequence is also made to start in 
state zero. : However, to keep the same decoder 
architecture for the second MAP decoder, the N bits 
are not forced to end in state zero. This implies 
that the final state is unknown. The second MAP 
decoder takes jthis into account by initializing all 
the reverse state metrics to zero. 

Figure 18 shows a photograph of the encoder and 
IAGs. To the left-hand side are seven DIP switches 
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Figaro 18. Turbo encoder and interfeaver address generator 



used to programme the code polynomials for both 
the encoder and decoder. The bottom Xilinx chip 
performs the encoder function. The middle Xilinx 
chip is the address counter and multiplexer for the 
I AG. Figure 19 shows a decoder iteration, the 
first decoder prototype was implemented using 
speedwire, which allowed any design corrections to 
be easily made. This prototype was then 
implemented on a printed circuit board. From the 
top, the Xilinx chips are the control logic and 
address generator, MAP decoder 1, MAP decocter 



2, data delay address generator and miscellaneous 
logic (left) and branch metric calculator (right). The 
large chips to the left and right of the Xilinx chips 
are the !Kx8 dual- port SRAMs used to store the 
new and old state metrics (SMs). The left chips 
store the reverse SMs and the right chips store the 
forward SMs. To the left of the control logic chip 
are the 64Kx4 SRAMs where we store the branch 
metrics to be used in calculating the reverse SMs. 
The four 64K x4 SRAMs to the right of the control 
logic chip are used to store the forward SMs to be 



Figure 19. Turbo decoder iteration 
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used in calculating the log likelihood ratio. Between 
the bottom two Xilinx chips are the 16K x 4 SRAMs 
for delaying the input and producing the extrinsic 
information. The 12 SRAMs at the bottom left of 
the board are for delaying the n 6 bit inputs (six 
256K x 1 and six 1M x 1 SRAMs). The four 
64Kx4 SRAMs at the bottom right-hand corner 
perform the deinterleaving and interleaving of the 
data. 

Figure 20 is a photograph of the completed codec. 
A 6U high 48 cm rack is used which contains, from 
left to right, the encoder/interface card, turbo/MAP 
decoder card 1, interleaver address delay card 1, 
turbo/MAP decoder cards 2-5, interleaver address 
delay card 2 and turbo/MAP decoder cards 6 and 
7. An additional 1 1 turbo/MAP decoder cards can 
fit within the rack to give a total of 18 iterations. 
Turbo/MAP decoder 7 has its switches in the up 
position, indicating that the decoder is set up for 
seven iterations. Each iteration can also output the 
first decoder output, using a 16Kx4 SRAM from 
the second MAP decoder to reverse the data in time. 

Note that our current decoder implementation is 
not able to automatically synchronize to a received 
signal. Instead, a signal from the encodep*was used 
to synchronize the decoder. Future modifications 
may include a synchronization word in the encoded 
block to allow synchronization. 

4.2. Turbo Decoder Performance 

The first tests were made using a rate 1/3 turbo 
code using identical rate 1/2 16-state codes with 
polynomials g Q = 3\ and £, = 33 (in octal); 43 These 



code polynomials were optimized for rate 1/3 turbo 
codes and were found to perform better than the 
codes from Reference 6 (which were optimized for 
a rate 1/2 turbo code). The value of A was set to 
IS for all EJN& which is close to the optimum 
values. An 5 = 31, 44 65,536 bit interleaver was used 
(5 = 31 implies that any two consecutive bits are 
separated by at least 31 other bits after interleaving). 
The actual rate is reduced by 4092/4096 
= 1023/1024 owing to the MAP block size being 
only 4096 bits, with 4 bits used as the tail. The 
'inner* code is terminated to state zero, while the 
'outer* code is not terminated. Shannon capacity at 
this rate is at an EJN 0 - -0-55 dB (the capacity at 
this rate with QPSK modulation is -049 dB). 

In Figure 21 we plot BER versus EJN 0 for 6-5 
and seven iterations (a half-iteration is the output 
of the first MAP decoder). We see that 10" s and 
KT 6 are achieved at 0-32 and 0-38 dB respectively. 
These are 0-87 and 0-93 dB away from rate 1/3 
capacity for BERs of 10 -3 and iO^ 5 respectively. 
Unfortunately, 10" 7 is close to an EJN 0 of 1-0 dB 
owing to the BER flattening above 0-4 dB. We also 
see that above 04 dB the BER from 6-5 iterations 
performs better than that from seven iterations. This 
may be due to the second MAP decoder being 
unterminated or perhaps an effect of other non- 
linearities in the decoder. 

Also note how the performance shallows between 
0-35 and 0-4 dB. There appears to be a sudden 
change in slope. As can be seen, it is not an error 
floor, since the BER does decrease with increasing 
EJN& The cause of this sudden change in slope is 
the small free, distance of turbo codes. 45 This free 




Figure 20. Turbo codec 
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distance has a spectral component much less than 
one, which greatly reduces its error probability (in 
fact, this reduction is inversely proportional to iV) 45 ). 
However, since the free distance term has a shallow 
slope, this slope will appear at relatively high E^JN 0 . 

Computer simulations for this scheme have not 
been previously performed. Thus we make a com- 
parison with the rate 1/3, 16-state, 16,384 
scheme in Reference 46. With 1 1 iterations it achi- 
eved an E b /N 0 of 0-24 dB at a BER of JO -5 . This 
is 0*08 dB better than Our scheme, which has seven 
iterations and an inferleaver that is four times larger. 

The first MAP decoder had mostly .dcoM^btt' 
eirbr outputs at low BER: This is* predicted in 
Reference 45, where the use of systematic convql- 
utional codes leads to error sequences of information 
weight two (corresponding to the error bit causing 
the sequence to leave the path and the error bit 
causing the sequence to return to the path). This is 
very important for turbo codes, since it greatly 
increases their performance over the use of non- 
systematic enoders (which have single-bit error 
patterns). 45 Also, systematic encoders perform 



slightly better over non-systematic encoders at low 
SNR; 42 which is important in iterative decoding. 

The output from the second MAP decoder at low 
BERs was mostly in single-bit errors, indicating that 
the unterminated states could be causing a problem. 
Since there are N/N= 16 MAP blocks in each 
interleaver block, the effect of all these unterminated 
states could induce these single-bit errors. 

It was found that if the subtractor outputs in 
Figure 15 were limited to have a range from -128 
to +127, the decoder would perform very poorly. 
When the extrinsic information is added to the 
received noisy sample (which ranges from -31 to 
+31), errors could be introduced owing to non- 
linearities in the design. For example, say we receive 
+31 and add the extrinsic information +127 to it. 
The output of the first MAP decoder will be limited 
to +127. When we subtract the extrinsic information, 
the information fed into the second MAP decoder 
will then be zero! TTie results shown in Figure 15 
had the subtractor outputs limited from ^64 to +63 
to avoid this problem. 

In Figure 22 we plot the BER against the number 
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of iterations for EJN o = 0-0 t 0-1/0-2, 0-3, 0*4, 0-5 
and 1-0 dB. Of interest is how the BER suddenly 
flattens after quickly decreasing for 0-4< 0-5 and 
1*0 dB. Another point of interest is the potential of 
more iterations. The 0-2 dB curve indicates that 
further improvement is possible and the <M dB curve 
might be able to reach low BERs as well 

Figure 23 shows the performance of our rate 1/7 
: 16-state turbo decoder with code polynomials 
'go = 23; g, = 35, g 2 = 27 and g 3 = 37 from Reference 
46. to obtain better performance, the extrinsic infor- 
mation subtracter outputs were limited from -96 to 
+95. We set A = 7 to avoid limiting the state metrics 
too greatly. Hie 64K interleaver frfcm Reference 24 
was tried but was found to give inferior performance 
to our randomly generated interleaver (although it 
must be rioted that this interleaver was designed for 
a rate 1/2 turbo code). 

Values of E h fN 0 of -0-30, --0-27 and -019 dB 
are achieved for BERs of 10" 3 , iO -6 and 10~ 7 
respectively. Note that the 10~ 7 BER is achieved 
with 6-5 iterations. Shannon capacity at this rate is 
at -142 dB, from which we are 0-82 dB away at a 



BER of 10~ 5 . With more iterations we expect to 
reduce this gap by 0-1-0*2 dB. 

5. CONTINUOUS DECODING AND 
SYNCHRONIZATION 

A way of improving the decoding design is to use 
a continuous MAP decoder. A continuous decoder 
does not need a tail to be added to the sequence 
and only needs to synchronize to the n coded sym- 
bols. We will give a description of an efficient 
implementation of a continuous MAP decoder along 
with a synchronization technique that can be used 
for a turbo decoder. 

5.1. Continuous Decoding 

A continuous decoding algorithm for the MAP 
algorithm was first presented for the ISI channel 
in Reference 47 (with a complexity exponentially 
proportional to the decoder delay). A simpler algor- 
ithm for continuous decoding of convolutiona! codes 
was first described in Reference 48 and, later in 
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References 29, 49 and 50. A similar algorithm for 
the ISI channel is given in Reference 51. Hie n 
received data symbols (R k ) are stored from time k 
tq * + 1. They are then read out in reverse order 
from time k + L- 1 to k and 3? +) determined recur- 
sively (starting with 0?^= 1 for all m). The forward 
SMs ot? are also calculated as normal using the R k 
that have been djelayed by 2/,. Using 
of, 8£* and X* is determined as normal. We 
then increment fc by one and repeat the whole 
algorithm. By making L sufficiently large^ the p&j 
that is determined will be very close to the true 
3&, had $H L been known precisely. We caaiee 
that; this algorithm is computationally*" Intensive, 
since each is calculated L times instead of 
only once. 

A computationally simpler algorithm was first 
mentioned in Reference 18 and liter in more detail 
in References 52-54. As for the previous algorithm, 
we store R k into a RAM of size nL The reverse 
SMs are then calculated (starting with p£t=l for 
all m). However, the reverse SMs continue to be 
calculated from time k to A-L+l using the R k 



from the reversing RAM. After the R k have been 
delayed by 2L (using another RAM of size 2nL\ 
the forward SMs aie calculated and stored in a 
RAM of size 2 V L. The time-reversed forward SMs 
are. combined with the last L reverse SMs to give 
a time -reversed LLR output. We can see that only 
half the reverse SMs that are calculated are used. 
Thus two reverse SM calculators are used in pipe- 
line. 

Figure 24 shows how R k is stored and read for 
use by the ieveise SMCs. In this case we have 
assumed that L = 64. The first RAM is used to time 
reverse R k in blocks of L and the second RAM is 
used to delay the reversed R k by 2L. Two mul- 
tiplexers are then used to alternatively select the 
RAM outputs for use by the twa RSMCs. 

Using the read/ write technique, the total amount 
of RAM that is required is 5Lxn? + I2 v x8, where 
we have assumed 8 bit SMs and LLR. An additional 
2 v x8bits would be required if the output had to 
be reversed in. time. This is still considerably less 
complex than having L RSMCs as in the previous 
algorithm. If the decoder is to be used in a turbo 
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decoder, this reversal can be performed within the 
interleaver. The total delay is 4L (with the LLR 
reverser) or 3L (without the LLR reverser). 

5.2. A Synchronization technique for Turbo 
Decoders 

In a turbo decoder an important consideration is 
being able to synchronize to the interleaved data of 
depth N t . One could insert synchronization woitis, 
but this increases the complexity and decreases the 
bandwidth efficiency. The first MAP decoder needs 
only to synchronize to n possible states. This can 
be done by monitoring the average amplitude of L k . 
When the decoder is out of synchronization, the 
average value of \L k \ will he lower than expected. 
This is due to the decoder not receiving a valid 
code sequence and producing an unreliable 
decoded sequence. 

By accumulating \L k \ for a certain length of time 
and comparing it with a threshold, the decoder 
can make a reliable decision as to whether it is 
synchronized. If the decoder is not synchronized, it 
tries the next state and then waits for a decoder 
delay {in order for the decoder to produce stable 
data) before it starts monitoring |L*| again. Tins 
process continues until synchronization is achieved. 

The average synchronization time (f a ) depends on 
the decoder delay (r d ), the time to average \L k \ (fj 
and the number of synchronization states (n,). Since 
' we can randomly start in any state (taking on aver- 
age (i»-I)/2 attempts before synchronization is 

achieved), we have that \ 

j 

4 = ad + 0(/i s -i)/2 " " (so) 

The worst-case synchronization time is only twice 
the average synchronization time. For a turbo 
decoder we have 



*s = (t d + t a )(n-l)/2 
+ W + 'd + 0(A/,-l)/2 



(51) 



The first part of (51) corresponds to the first MAP 



decoder and the second part to the second MAP 
decoder. For large AT, the synchronization time will 
be dominated by the second MAP decoder 

For example, if r d = 3L=192, r 4 = 64, /i = 2 and 
«/ = 65,536, then / B = 2,155,839,488 bits! At 
2 048 Mbit/s it would thus take on average about 
17.5 min to synchronize; Once synchronized, one 
should increase t a so that the synchronization circuit 
does not indicate a false alarm with very high prob- 
ability. 

A way of reducing the overall synchronization 
time is to evenly distribute the synchronization time 
between the two decoders. We could do this by 
adding moduIo-2 a random binary sequence of 
length L, to the parity of the rim individual code. 
This forces the inner decoder to synchronize to 
nLi states; 

We also have that Lfe^N, (to keep the overall 
number of synchronization states the same). The 
outer decoder now needs only to synchronize to 
states. The average synchronization time is then 



'. = Ui + f a )(«L I -~l)/2 
^(M + ^ + Oi:^-!)^ 



(52) 



One can then try various values of I, and 1^ to 
minimize (52). Assuming a real-valued the opti- 
mum value of is (taking the differential of (52) 
arid setting it to zero) 



r AWv/ + r d + / 0 ) 
n(t d + f.) 



(53) 



For our example we thus have £, = 2901-96 and 
Za = 22-58. Since L x and must be integers, we 
let £,=2849, Lj = 23 and N t - 65,527 = 2 , *~ 9. 
From (52) we have t % => 1,452,829 bits, a nearly 1500 
times reduction compared with the original scheme. 
At 2-048 Mbit/s this corresponds to an average 
synchronization time of 0-709 s- 

Alternatively, we could let I, =4096 and ^ 16, 
which only slightly increases /, to 1,541,888 bits (a 
delay of 0-753 s at 2-048 Mbit/s). Hie combination 
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of L t = 2048 and = 32 is only marginally slower 
at 1,543,936 bits or 0754 s. 

6. CONCLUSIONS 

The original turbo coding paper by Berrou et aL 6 
presented a coding scheme that came within 0-7 dB 
of Shannon capacity. The MAP decoding algorithm 
that was presented was much too complicated to 
implement practically. We have rederived die MAP 
algorithm to present it in as simple a form as 
possible. By taking the logarithm of the MAP algor^ 
ithm, a realizable implementation could be achieved 
at high data rates. 

The turbo decoder that we have constructed has 
indeed been able to verify the amazing performance 
presented in Reference 6. We were able to come 
within 0-8 dB of capacity with only seven iterations. 
With up to 18 iterations we can expect to reduce 
this amount by 0-1 -0-2 dB. Thus we have been able 
to demonstrate that near-Shannon performance can 
be achieved at high data rates. 

We used a block-type MAP algorithm which 
requires a lot of memory for its implementation. An 
efficient implementation of a continuSus decoder 
has been presented. Continuous decoders have the 
advantage of being less complex, are easier to 
synchronize, have a much smaller delay and have 
slightly better performance. 

A synchronization technique for turbo decoders 
which takes advantage of the low delay of continu- 
ous decoders has also been presented. It is shown 
that even for large interleaver sizes, automatic 
synchronization can be achieved in a relatively small 
time interval. 
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APPENDIX. GLOSSARY OF ABBREVIATIONS 



ACS add-compare-select 

AGC automatic gain control 

APoP a posteriori probability 

APrP a priori probability 

AWGN additive white Gaussian noise 

BER bit error ratio 

BM branch metric 

BMC branch metric calculator 

BPSK binary phase shift keying 

CE dock enable 



CLB 


configurable logic block 


CLK 


clock 


CP 


clock input 


CTL 


control 


D-FF 


data flip-flop 


DCLK 


data clock 


DEC 


decoder 


DEINT 


deinterleaver 


DEL 


delay 


DIP 


dual inline package 


DPRAM 


dual-port random access memory 


EM 


eccumalator metric 


ENC 


encoder 


EPROM 


erasable programmable read-only 




memory 


FSM 


forward state metric 


FSMC 


forward state metric calculator 


I/O 


input/output 


IAG 


interleaver address generator 


INT 


interleaver 


ISI 


intersymbol interference 


LLR 


log likelihood ratio 


LLRC 


log likelihood ratio calculator 


MA 


multiply-add 


MAP 


maximum a posteriori 


MB 


megabyte 


MUX 


multiplexer 


OE 


output enable 


PC 


personal computer 


PM 


path metric 


QPSK 


quadrature phase shift keying 


RAM 


random access memory 


RS 


Reed-Solomon 


RSM 


reverse state metric 


RSMC 


reverse state metric calculator 


SEL 


select 


SISO 


soft-in soft-out 


SM 


state metric 


SMC 


state metric calculator 


SNR 


signal-to-noise ratio 


SRAM 


static random access memory 


SOVA 


soft-output Viterbi algorithm 


WE 


write enable 


XNOR 


exclusive negative OR 
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